Search CORE

36 research outputs found

Scheduling data flow program in xkaapi: A new affinity based Algorithm for Heterogeneous Architectures

Author: A. Buttari
C. Augonnet
D.S. Hochbaum
E. Agullo
E. Hermann
F. Song
G. Bosilca
H. Topcuoglu
J.V.F. Lima
S. Kedad-Sidhoum
S. Tomov
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Efficient implementations of parallel applications on heterogeneous hybrid architectures require a careful balance between computations and communications with accelerator devices. Even if most of the communication time can be overlapped by computations, it is essential to reduce the total volume of communicated data. The literature therefore abounds with ad-hoc methods to reach that balance, but that are architecture and application dependent. We propose here a generic mechanism to automatically optimize the scheduling between CPUs and GPUs, and compare two strategies within this mechanism: the classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new, parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which consists in grouping the tasks by affinity before running a fast dual approximation. We ran experiments on a heterogeneous parallel machine with six CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra kernels from the PLASMA library have been ported on top of the Xkaapi runtime. We report their performances. It results that HEFT and DADA perform well for various experimental conditions, but that DADA performs better for larger systems and number of GPUs, and, in most cases, generates much lower data transfers than HEFT to achieve the same performance

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Fibers are not (P)Threads: The Case for Loose Coupling of Asynchronous Programming Models and MPI Through Continuations

Author: Barcelona Supercomputing Center
Bosilca G.
Grant E.
Hoefler Torsten
IEEE and The Open Group
Iwasaki S.
Pritchard Howard
Schuchart J.
Schuchart Joseph
Unified Communication Framework Consortium
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 12/11/2020
Field of study

Asynchronous programming models (APM) are gaining more and more traction, allowing applications to expose the available concurrency to a runtime system tasked with coordinating the execution. While MPI has long provided support for multi-threaded communication and non-blocking operations, it falls short of adequately supporting APMs as correctly and efficiently handling MPI communication in different models is still a challenge. Meanwhile, new low-level implementations of light-weight, cooperatively scheduled execution contexts (fibers, aka user-level threads (ULT)) are meant to serve as a basis for higher-level APMs and their integration in MPI implementations has been proposed as a replacement for traditional POSIX thread support to alleviate these challenges. In this paper, we first establish a taxonomy in an attempt to clearly distinguish different concepts in the parallel software stack. We argue that the proposed tight integration of fiber implementations with MPI is neither warranted nor beneficial and instead is detrimental to the goal of MPI being a portable communication abstraction. We propose MPI Continuations as an extension to the MPI standard to provide callback-based notifications on completed operations, leading to a clear separation of concerns by providing a loose coupling mechanism between MPI and APMs. We show that this interface is flexible and interacts well with different APMs, namely OpenMP detached tasks, OmpSs-2, and Argobots.Comment: 12 pages, 7 figures Published in proceedings of EuroMPI/USA '20, September 21-24, 2020, Austin, TX, US

arXiv.org e-Print Archive

Crossref

Correlated Set Coordination in Fault Tolerant Message Logging Protocols

Author: A. Bouteiller
D. Buntinas
G. Bosilca
J.C.Y. Ho
J.M. Hlary
K.M. Chandy
L. Alvisi
L. Kale
L. Lamport
P. Lemarinier
S. Rao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Abstract. Based on our current expectation for the exascale systems, composed of hundred of thousands of many-core nodes, the mean time between failures will become small, even under the most optimistic as-sumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases, due to the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes, but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols, but eliminates the need for costly payload logging between coordinated processes.

CiteSeerX

Crossref

A taxonomy of task-based parallel programming technologies for high-performance computing

Author: A Duran
AD Robison
BL Chamberlain
C Augonnet
Dimitrios S. Nikolopoulos
Erwin Laure
G Bosilca
Herbert Jordan
K Huck
Khalid Hasanov
Kiril Dichev
Kostas Katrinis
L Dagum
Peter Thoman
Philipp Gschwandtner
Pierre Lemarinier
R Blumofe
RD Blumofe
Roman Iakymchuk
S Seo
Stefano Markidis
Thomas Fahringer
Thomas Heller
Xavier Aguilar
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Recovery Patterns for Iterative Methods in a Parallel Unstable Environment

Author: Bosilca G
Chen Z
Dongarra J
Langou J
Publication venue
Publication date: 01/01/2007
Field of study

Several recovery techniques for parallel iterative methods are presented. First, the implementation of checkpoints in parallel iterative methods is described and analyzed. Then, a simple checkpoint-free fault tolerant scheme for parallel iterative methods, the lossy approach, is presented. When one processor fails and all its data is lost, the system is recovered by computing a new approximate solution using the data of the non-failed processors. The iterative method is then restarted with this new vector. The main advantage of the lossy approach over standard checkpoint algorithms is that it does not increase the computational cost of the iterative solver, when no failure occurs. Experiments are presented that compare the different techniques. The fault tolerant FT-MPI library is used. Both iterative linear solvers and eigensolvers are considered

CiteSeerX

MIMS EPrints

The University of Manchester - Institutional Repository

A Current Task-Based Programming Paradigms Analysis

Author: E Deelman
G Bosilca
M Sato
M Snir
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 15/06/2020
Field of study

International audienceTask-based paradigm models can be an alternative to MPI. The user defines atomic tasks with a defined input and output with the dependencies between them. Then, the runtime can schedule the tasks and data migrations efficiently over all the available cores while reducing the waiting time between tasks. This paper focus on comparing several task-based programming models between themselves using the LU factorization as benchmark. HPX, PaRSEC, Legion and YML+XMP are task-based programming models which schedule data movement and computational tasks on distributed resources allocated to the application. YML+XMP supports parallel and distributed tasks with XscalableMP, a PGAS language. We compared their performances and scalability are compared to ScaLA-PACK, an highly optimized library which uses MPI to perform communications between the processes on up to 64 nodes. We performed a block-based LU factorization with the task-based programming model on up to a matrix of size 49512 × 49512. HPX is performing better than PaRSEC, Legion and YML+XMP but not better than ScaLAPACK. YML+XMP has a better scalability than HPX, Legion and PaRSEC. Regent has trouble scaling from 32 nodes to 64 nodes with our algorithm

Crossref

INRIA a CCSD electronic archive server

HAL-CEA

Hal-Diderot

HAL UVSQ

Innovative Computing Laboratory,

Author: A. Lumsdaine
B. Barrett
G. Bosilca
J. M. Squyres
R. L. Graham
Publication venue
Publication date
Field of study

Abstract. Component architectures provide a useful framework for developing an extensible and maintainable code base upon which largescale software projects can be built. Component methodologies have only recently been incorporated into applications by the High Performance Computing community, in part because of the perception that component architectures necessarily incur an unacceptable performance penalty. The Open MPI project is creating a new implementation of the Message Passing Interface standard, based on a custom component architecture – the Modular Component Architecture (MCA) – to enable straightforward customization of a high-performance MPI implementation. This paper reports on a detailed analysis of the performance overhead in Open MPI introduced by the MCA. We compare the MCA-based implementation of Open MPI with a modified version that bypasses the component infrastructure. The overhead of the MCA is shown to be low, on the order of 1%, for both latency and bandwidth microbenchmarks as well as for the NAS Parallel Benchmark suite.

CiteSeerX

Analysis of the component architecture overhead

Author: A. Lumsdaine
B. Barrett
G. Bosilca
J. M. Squyres
R. L. Graham
Publication venue
Publication date
Field of study

CiteSeerX